Dimension independent similarity computation
نویسندگان
چکیده
We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension; thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems with large scale experiments using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com.
منابع مشابه
Incremental All Pairs Similarity Search for Varying Similarity Thresholds with Reduced I/O Overhead
All Pairs Similarity Search (APSS) is the problem of finding all pairs of records with similarity scores above a specified threshold. Incremental All Pairs Similarity Search (IAPSS) is the problem of performing APSS multiple times over the same dataset by varying the similarity threshold. This problem is ubiquitous in many real-world systems like search engines, online social networks, and digi...
متن کاملNotes on quantitative structure-properties relationships (QSPR) (1): A discussion on a QSPR dimensionality paradox (QSPR DP) and its quantum resolution
Classical quantitative structure-properties relationship (QSPR) statistical techniques unavoidably present an inherent paradoxical computational context. They rely on the definition of a Gram matrix in descriptor spaces, which is used afterwards to reduce the original dimension via several possible kinds of algebraic manipulations. From there, effective models for the computation of unknown pro...
متن کاملExpectations on fractal sets
Using fractal self-similarity and functional-expectation relations, the classical theory of box integrals—being expectations on unit hypercubes—is extended to a class of fractal “string-generated Cantor sets” (SCSs) embedded in unit hypercubes of arbitrary dimension. Motivated by laboratory studies on the distribution of brain synapses, these SCSs were designed for dimensional freedom—a suitabl...
متن کاملWaldHash: sequential similarity-preserving hashing
Similarity-sensitive hashing seeks compact representation of vector data as binary codes, so that the Hamming distance between code words approximates the original similarity. In this paper, we show that using codes of fixed length is inherently inefficient as the similarity can often be approximated well using just a few bits. We formulate a sequential embedding problem and approach similarity...
متن کاملRetrieving Images by 2D Shape: A Comparison of Computation Methods with Human Perceptual Judgments
In content based image retrieval, systems allow users to ask for objects similar in shape to a query object. However, there is no clear understanding of how computational shape similarity corresponds to human shape similarity. In this paper several shape similarity measures were evaluated on planar, connected, non-occluded binary shapes. Shape similarity using algebraic moments, spline curve di...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Machine Learning Research
دوره 14 شماره
صفحات -
تاریخ انتشار 2013